10 research outputs found
The Design of Arbitrage-Free Data Pricing Schemes
Motivated by a growing market that involves buying and selling data over the
web, we study pricing schemes that assign value to queries issued over a
database. Previous work studied pricing mechanisms that compute the price of a
query by extending a data seller's explicit prices on certain queries, or
investigated the properties that a pricing function should exhibit without
detailing a generic construction. In this work, we present a formal framework
for pricing queries over data that allows the construction of general families
of pricing functions, with the main goal of avoiding arbitrage. We consider two
types of pricing schemes: instance-independent schemes, where the price depends
only on the structure of the query, and answer-dependent schemes, where the
price also depends on the query output. Our main result is a complete
characterization of the structure of pricing functions in both settings, by
relating it to properties of a function over a lattice. We use our
characterization, together with information-theoretic methods, to construct a
variety of arbitrage-free pricing functions. Finally, we discuss various
tradeoffs in the design space and present techniques for efficient computation
of the proposed pricing functions.Comment: full pape
The Fine-Grained Complexity of CFL Reachability
Many problems in static program analysis can be modeled as the context-free
language (CFL) reachability problem on directed labeled graphs. The CFL
reachability problem can be generally solved in time , where is the
number of vertices in the graph, with some specific cases that can be solved
faster. In this work, we ask the following question: given a specific CFL, what
is the exact exponent in the monomial of the running time? In other words, for
which cases do we have linear, quadratic or cubic algorithms, and are there
problems with intermediate runtimes? This question is inspired by recent
efforts to classify classic problems in terms of their exact polynomial
complexity, known as {\em fine-grained complexity}. Although recent efforts
have shown some conditional lower bounds (mostly for the class of combinatorial
algorithms), a general picture of the fine-grained complexity landscape for CFL
reachability is missing.
Our main contribution is lower bound results that pinpoint the exact running
time of several classes of CFLs or specific CFLs under widely believed lower
bound conjectures (Boolean Matrix Multiplication and -Clique). We
particularly focus on the family of Dyck- languages (which are strings with
well-matched parentheses), a fundamental class of CFL reachability problems. We
present new lower bounds for the case of sparse input graphs where the number
of edges is the input parameter, a common setting in the database
literature. For this setting, we show a cubic lower bound for Andersen's
Pointer Analysis which significantly strengthens prior known results.Comment: Appeared in POPL 2023. Please note the erratum on the first pag
Ranked Enumeration of Conjunctive Query Results
We study the problem of enumerating answers of Conjunctive Queries ranked according to a given ranking function. Our main contribution is a novel algorithm with small preprocessing time, logarithmic delay, and non-trivial space usage during execution. To allow for efficient enumeration, we exploit certain properties of ranking functions that frequently occur in practice. To this end, we introduce the notions of decomposable and compatible (w.r.t. a query decomposition) ranking functions, which allow for partial aggregation of tuple scores in order to efficiently enumerate the output. We complement the algorithmic results with lower bounds that justify why restrictions on the structure of ranking functions are necessary. Our results extend and improve upon a long line of work that has studied ranked enumeration from both a theoretical and practical perspective
General Space-Time Tradeoffs via Relational Queries
In this paper, we investigate space-time tradeoffs for answering Boolean
conjunctive queries. The goal is to create a data structure in an initial
preprocessing phase and use it for answering (multiple) queries. Previous work
has developed data structures that trade off space usage for answering time and
has proved conditional space lower bounds for queries of practical interest
such as the path and triangle query. However, most of these results cater to
only those queries, lack a comprehensive framework, and are not generalizable.
The isolated treatment of these queries also fails to utilize the connections
with extensive research on related problems within the database community. The
key insight in this work is to exploit the formalism of relational algebra by
casting the problems as answering join queries over a relational database.
Using the notion of boolean {\em adorned queries} and {\em access patterns}, we
propose a unified framework that captures several widely studied algorithmic
problems. Our main contribution is three-fold. First, we present an algorithm
that recovers existing space-time tradeoffs for several problems. The algorithm
is based on an application of the {\em join size bound} to capture the space
usage of our data structure. We combine our data structure with {\em query
decomposition} techniques to further improve the tradeoffs and show that it is
readily extensible to queries with negation. Second, we falsify two proposed
conjectures in the existing literature related to the space-time lower bound
for path queries and triangle detection for which we show unexpectedly better
algorithms. This result opens a new avenue for improving several algorithmic
results that have so far been assumed to be (conditionally) optimal. Finally,
we prove new conditional space-time lower bounds for star and path queries.Comment: Appeared in WADS 2023. Comments and suggestions are always welcom
Enumeration Algorithms for Conjunctive Queries with Projection
We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which maybe of independent interest. In particular, we design combinatorial algorithms that provide instance-specific delay guarantees in linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present {new} results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. CQs with projection where the join attribute is projected away is equivalent to boolean matrix multiplication. Our results can therefore be also interpreted as sparse, output-sensitive matrix multiplication with delay guarantees
Space-Time Tradeoffs for Conjunctive Queries with Access Patterns
In this paper, we investigate space-time tradeoffs for answering conjunctive
queries with access patterns (CQAPs). The goal is to create a space-efficient
data structure in an initial preprocessing phase and use it for answering
(multiple) queries in an online phase. Previous work has developed data
structures that trades off space usage for answering time for queries of
practical interest, such as the path and triangle query. However, these
approaches lack a comprehensive framework and are not generalizable. Our main
contribution is a general algorithmic framework for obtaining space-time
tradeoffs for any CQAP. Our framework builds upon the \PANDA algorithm and
tree decomposition techniques. We demonstrate that our framework captures all
state-of-the-art tradeoffs that were independently produced for various
queries. Further, we show surprising improvements over the state-of-the-art
tradeoffs known in the existing literature for reachability queries
Rapidash: Efficient Constraint Discovery via Rapid Verification
Denial Constraint (DC) is a well-established formalism that captures a wide
range of integrity constraints commonly encountered, including candidate keys,
functional dependencies, and ordering constraints, among others. Given their
significance, there has been considerable research interest in achieving fast
verification and discovery of exact DCs within the database community. Despite
the significant advancements in the field, prior work exhibits notable
limitations when confronted with large-scale datasets. The current
state-of-the-art exact DC verification algorithm demonstrates a quadratic
(worst-case) time complexity relative to the dataset's number of rows. In the
context of DC discovery, existing methodologies rely on a two-step algorithm
that commences with an expensive data structure-building phase, often requiring
hours to complete even for datasets containing only a few million rows.
Consequently, users are left without any insights into the DCs that hold on
their dataset until this lengthy building phase concludes. In this paper, we
introduce Rapidash, a comprehensive framework for DC verification and
discovery. Our work makes a dual contribution. First, we establish a connection
between orthogonal range search and DC verification. We introduce a novel exact
DC verification algorithm that demonstrates near-linear time complexity,
representing a theoretical improvement over prior work. Second, we propose an
anytime DC discovery algorithm that leverages our novel verification algorithm
to gradually provide DCs to users, eliminating the need for the time-intensive
building phase observed in prior work. To validate the effectiveness of our
algorithms, we conduct extensive evaluations on four large-scale production
datasets. Our results reveal that our DC verification algorithm achieves up to
40 times faster performance compared to state-of-the-art approaches.Comment: comments and suggestions are welcome
Holistic Cube Analysis: A Query Framework for Data Insights
We present Holistic Cube Analysis (HoCA), a framework that augments the
capabilities of relational queries for data insights. We first define
AbstractCube, a data type defined as a function from RegionFeatures space to
relational tables. AbstractCube provides a logical form of data for HoCA
operators and their compositions to operate on to analyze the data. This
function-as-data modeling allows us to simultaneously capture a space of
non-uniform tables on the co-domain of the function, and region space structure
on the domain of the function. We describe two HoCA operators, cube crawling
and cube join, which are cube-to-cube transformations (i.e., higher-order
functions). Cube crawling explores a region subspace, and outputs a cube
mapping regions to signal vectors. Cube join, in turn, allows users to meld
information in different cubes, which is critical for composition. The cube
crawling interface introduces two novel features: (1) Region Analysis Models
(RAMs), which allows one to program and organize analysis on a set of data
features into a module. (2) Multi-Model Crawling, which allows one to apply
multiple models, potentially on different feature sets, during crawling. These
two features, together with cube join and a rich RAM library, allows us to
construct succinct HoCA programs to capture a wide variety of data-insight
problems in system monitoring, experimentation analysis, and business
intelligence. HoCA poses a rich algorithmic design space, such as optimizing
crawling performance leveraging region space structure, optimizing cube join
performance, and physical designs of cubes. We describe several cube crawling
implementations leveraging different foundations (an in-house relational query
engine, and Apache Beam), and evaluate their performance characteristics.
Finally, we discuss avenues in extending the framework, such as devising more
useful HoCA operators.Comment: Establishing core concepts of HoC